Contents

1 Today: Download, QC and Preprocessing of Illumina data:

pipeline flow chart

Figure 1: pipeline flow chart

2 Login commands

2.1 ssh

ssh meds5420usrX@xanadu-submit-ext.cam.uchc.edu

2.2 sftp

sftp meds5420usrX@transfer.cam.uchc.edu

2.3 interactive session

srun  --pty -p mcbstudent --qos=mcbstudent --mem=2G bash

2.4 Installing precompiled binaries

You can see the binaries available for sratoolkit here: https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit.

Precompiled binaries are often made available for 32 bit or 64 bit operating systems. To find out what you have, type:

uname -a
## Darwin guertinlab 21.5.0 Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:37 PDT 2022; root:xnu-8020.121.3~4/RELEASE_ARM64_T6000 x86_64

Next, you can download the software from a website, or if you have the url, you can also use wget.

First, install the wget function if you are on a Mac. If you do not have admin privileges on a Mac, you can use curl.

brew install wget

## FORMAT: wget -output-document <name_of_file> <weblink>
wget --output-document sratoolkit.tar.gz
https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.0/sratoolkit.3.0.0-ubuntu64.tar.gz

Not all Macs will have wget. In this case, you can use curl, as shown below. We will use wget in class, so please install wget with brew.

## FORMAT: curl -O <name_of_file> <weblink>

#make sure you use -output or -O so that the file 
#is saved and not printed to screen

wget --output-document sratoolkit.tar.gz
https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.0/sratoolkit.3.0.0-mac64.tar.gz

How do you know if the download is complete? Recall that you can check the file integrity by using checksum. Checksum is a string of characters that reflect the number of bytes in a file. If the author-provided checksum matches yours after download, they are likely identical If they do not match, they are definitely different and the download should be repeated.

To get the checksum of your file after download use:

md5  <filename> # mac or Linux
md5sum <filename> # Linux only

The checksum values for the sra-toolkit downloads are found here: https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.0/md5sum.txt

b1c028945a6e54626cbaf25781151dae hisat2-2.2.1-64-ngs.3.0.0-linux.zip
259d8037eab50a7b3d8afdc932a190d3 hisat2-2.2.1-64-ngs.3.0.0-mac.zip
f1e2382ac45087fdc825b84484d4eb4f setup-apt.sh
2bb6291edd8024603ed7aa03fda49e6a setup-yum.sh
2b1de9f0801fbd26724a50be5ea7a9ce sratoolkit.3.0.0-centos_linux64-cloud-dbg.tar.gz
f54fde94c3948ccc246ed40f660f05c7 sratoolkit.3.0.0-centos_linux64-cloud.tar.gz
7f5bda044814a5f2ada727f34dd09529 sratoolkit.3.0.0-centos_linux64.tar.gz
a9a85e032d76c751eee378b080ea65b0 sratoolkit.3.0.0-mac64.tar.gz
7d58f946796e3c587dd473216ebadf8e sratoolkit.3.0.0-ubuntu64.tar.gz
11bda0fd523408b3dde3e82b58ded49f sratoolkit.3.0.0-win64.zip
a7bc4f4691c79596a8598960e676f56f sratoolkit.3.0.0.version.txt


Example checksums

md5 sratoolkit.tar.gz
## MD5 (sratoolkit.tar.gz) = a9a85e032d76c751eee378b080ea65b0

To report only the checksum value to the screen, use the ‘-q’ option:

md5 -q sratoolkit.tar.gz
## a9a85e032d76c751eee378b080ea65b0

It is now safe to unpack the software:

tar -xvzf sratoolkit.tar.gz

#OR
gunzip sratoolkit.tar.gz
tar -xvf sratoolkit.tar

Now try to run the run the command without arguments as you did in the past:

fastq-dump 

This time you are told the command is not found. But you know the program is on the computer, you just downloaded it. The problem is that the computer needs to be told the $PATH where programs or commands should be found.


3 Introduction to search path: $PATH

3.1 Where do we put the software after download?

In order to run a command from the terminal, the computer must know where the program is. You can do this manually, by specifying the full path to the program every time you use it. This is laboroious. To make life easier, each user has a ‘search path’ that tells the computer where to look for commands that are typed into the terminal. To view your path type:

echo $PATH

The search path is a list of paths that are separated by a colon. When you type a command the computer will look in each directory specified in the variable $PATH. If found, there is attempt to run the program. If not, you will get the error: command not found. To get your program to run, you can do any of three things:

1. Use the full path to the program when running it.
2. Move the program into one of the PATHs so that it is automatically recognized by bash.
3. Change the $PATH variable so that it includes the path to the program.

First let’s put the software in an appropriate place:
Many of the built in functions for your computer and some installed ones are in the /bin/ or /usr/bin/ directories. For example try:

which python3
## /usr/bin/python3

and

which grep
## /usr/bin/grep

which tells you where the program is, and these are indeed targets of the PATH variable. However, it isn’t a good habit to add custom installs to these directories because:

-Mistakes in altering directories with built-in functions can lead to corruption of core functions.
- it can be difficult to remove or alter your custom software in these directories.

There is another directory in the PATH that is convenient for custom installs by the user: /usr/local/bin/. It should be empty on new linux machines. Move your sra-toolkit folder there with mv.

Although /usr/local/bin is in the PATH, typing fastq-dump still gives you command not found. This is because the programs are within the sra-toolkit.3.0.0-ubuntu64/bin/ folder. So, if you try the following, the computer should recognize the program:

/usr/local/bin/sratoolkit.3.0.0-mac64/bin/fastq-dump

Since we don’t want to add this cumbersome path everytime we need another solution. We could unload all the programs to the /usr/local/bin/ folder, but since other software will likely go here, things could get crowded. I like to keep them separate, by leaving programs from software packages in their respective folders and then just change my PATH variable to point directly to those folders.

3.2 Changing your path

Temporary path changes:

export PATH=$PATH:$PWD/sratoolkit.3.0.0-mac64/bin

#OR
export PATH=$PATH:/usr/local/bin/sratoolkit.3.0.0-mac64/bin

echo $PATH
## /usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/Users/guertinlab/Documents/class/2023_lec7/sratoolkit.3.0.0-mac64/bin:/usr/local/bin/sratoolkit.3.0.0-mac64/bin

The above command will change your path to the bin where the sratoolkit scripts are. However, this is a temporary change that is only used for the current session. Thus, the altered path will be ignored when starting a new session. To make it permanent, you need to change the path in your bash profile
Your PATH variable is set inside your bash profile. This is a file called .bashrc (Linux) or .bash_profile (Linux or Mac). This file is located in your home directory, but is hidden from normal viewing. On Xanadu, O I empirically determined that .bashrc is not used at login, so modify .bash_profile for software installed on the server.


To see your bash profile, move to your home directory and type ls -a. You can use nano (might need to precede it with sudo), to alter your path in the following way:

sudo nano .bashrc #.bash_profile on Linux
sudo nano .bash_profile #.bash_profile on a Mac

Type this line at the end:

export PATH="$PATH:/usr/local/bin/sratoolkit.3.0.0-ubuntu64/bin/" 

OR

export PATH="$PATH:/usr/local/bin/sratoolkit.3.0.0-mac64/bin/"

This line overwrites the old PATH after appending the new path to the current PATH.
Exit nano and save the file.
Try to see what the new PATH looks like. NOTE: For the changes to take effect, you first need to start a new termial session.

echo $PATH 

The new PATH destination should be reflected and typing fastq-dump from any directory should allow the computer to find it and print the usage to the screen.
Congrats, you have completed a custom install of software binaries!!!

4 What if binaries are not available for your OS?

4.1 Installing from source

Installing from source is becoming less and less common as package managers are gaining in popularity. I tried to find a good example for you to install from source. I found that the most common tools sra-tools, bowtie2, meme, macs3, and bedtools all have different, sometimes complicated, and sometimes poorly documented installation instructions. The only general advice I have is to carefully read the README and INSTALL files.

Fortunately, most packages are available through package managers, preloaded in modules on the server, or available as binaries that are compatible for your system.

This link provides an explanation on what is happening during an install from source:
https://robots.thoughtbot.com/the-magic-behind-configure-make-make-install

5 Server operations: Using module to temporarily load functions:

We’re going to use the sratoolkit, fastqc, and fastx-tools amongst other things today and next time. These programs are on the server, however, we need to load them into our session for use. On our own machine we made sure the programs were in our $PATH. On the server we can use the module to load and have access to specific functions during your session without altering your $PATH. This provides flexibility and specificity in running software versions.
First, view which modules are available:

ssh <user_name>@xanadu-submit-ext.cam.uchc.edu 

module avail  

We can check specifically for sratoolkit:

module avail sratoolkit

Once you find the one you’re looking for:

module load <software>  # sratoolkit/2.8.1  for the sratoolkit  

This is convenient because you can add module functions to your shell scripts to be sure that you are using the correct software versions.

AND
You don’t have to waste time adding lots of programs to your $PATH

It is flexible because it allows you to even switch between version of software on the fly. For instance:

# load software
module load <software_v.2>   

# do something
some commands, etc.
#unload software and load another version
module unload <software_v.2>
module load <software_v.1>

# do something else
more commands, etc.

5.1 Server operations: Running jobs in the background

  • One problem with working on the server is that if you start a command and it takes a while, you’re stuck.
  • To get around this you can run the operation in the background using the ‘&’ symbol.
  • This runs the program in the background, allowing you to continue using your command prompt while the program is executed.
  • To start a command in the background just add & to the end of the command.
  • You should get a process ID number: a unique identifier that allows you to track that operation.

To see what processes are running you can use several commands:

  • top: continuous monitoring of activity
  • ps or ‘process’: shows the activity under your user and only shows that activity at that instance (snapshot).
  • qstat: when you’re on the server.
  • squeue: (covered previously) can show jobs by, user, ID, or node.

5.2 Killing jobs

To stop runaway processes or ones that are taking too long in the background, you cannot use the normal Control-c. You can, however, scancel the process using the process ID number.

# when on the Xanadu server
scancel <process_ID>

You can also run jobs in the background on your own machine. In this case, you can use kill to cancel jobs.

kill <process_ID>

You can review other ways to view activities on the server and kill processes here: http://bioinformatics.uconn.edu/unix-basics/

6 Downloading data from sra archive

6.1 Downloading options with sratoolkit:

prefetch: Downloads .sra file.

  • Works quickly
  • Data must be converted to fastq locally with fastq-dump
  • Can easily check for complete download vdb-validate
  • .sra files contain pre- and post-mapping information and QC.

fastq-dump: Downloads .fastq file.

  • Combines .sra download and simultaneous conversion to .fastq
  • Checksum not automated
  • Can check data integrity by comparing # seqs with output from vdb-dump --info

fasterq-dump: Downloads .fastq

7 Prefetch

This program directly downloads .sra files to you machine.

prefetch <options> <SRR-number> &  

prefetch will typically add a folder called ncbi to your home directory, in which a copy of the sra file will be placed in the following location:

~/ncbi/public/sra/

You should first validate that the download is complete with vdb-validate:

vdb-validate <SRR-number>

sratoolkit maintains a database of checksums for each data set. This will essentially run a checksum on the file (in the ~/ncbi/public/sra/ directory tree) to confirm that the download was complete./

You can then use fastq-dump to convert the .sra file to fastq.

fastq-dump <options> <SRR-number> &  

7.1 Fastq-dump

This program directly downloads .sra files to you machine and converts to fastq on the fly.

fastq-dump <options> <SRR-number> &  

The data should be in whichever directory you specify (default is current directory).

7.2 Let’s try downloading some data.

There are two data sets that we will practice with today.
Their accession #’s are:
SRR039637
and
SRR1552484

Create a /data/ folder within your MEDS5420 home directory on the server.


Start an interactive session and load modules:

#start an interactive session
srun --pty -p mcbstudent --qos=mcbstudent bash

#load module:
module load sratoolkit

#change to /data directory
cd ~/data

Now let’s use prefetch and fastq-dump to download data.

7.3 TRICK with prefetch: string together prefetch with validation and .fastq conversion

prefetch SRR039637 && vdb-validate SRR039637 && fastq-dump --gzip SRR039637

7.4 Fastq-dump only:

fastq-dump -F --gzip -X 20000000 SRR1552484  
  # -F: downloads the original format of the data
  # --gzip: compresses the data upon downloading

then use vdb-dump to get information about the file resulting from fastq-dump

vdb-dump --info SRR1552484  

Compare the number of sequences to number of lines in downloaded data. See how many lines there are in each file using zcat and wc.

Recall that zcat allows you to access gzipped files without actually uncompressing them completely. This can save a lot of time with decompressing/compressing large files.

Can you infer what the -X option specifies?

If you cannot retrieve the files, they are also in the data folder for our class: /home/FCAM/meds5420/data/

#For Linux (server)
zcat SRR039637.fastq.gz | wc -l 
#On a Mac
gzcat -c SRR039637.fastq.gz | wc -l 

You should get 23313288 lines.

8 Interpreting .fastq (.fq) format and Phred scores

8.1 Inspecting data

For today’s exercises, we don’t need all the data, so let’s subset some of it into your home directory for practice.
Subsetting data is generally useful as it allows you to quickly do the following:

1. Get the general idea of how a dataset is constructed
2. Assess the quality of a dataset
3. Debug processing or analysis scripts

Using gzcat again output the first 1000000 lines from SRR039637.fastq.gz into a new file called test1.fq and place it in your data folder.
Example:


gzcat SRR039637.fastq.gz | head -1000000 > ./data/test1.fastq

## If you are are working on the server, use the Linux version of commands
## gzcat: can't stat: SRR039637.fastq.gz (SRR039637.fastq.gz.gz): No such file or directory

Now go to your data folder and we’ll look at the data:

head -8 test1.fastq

.fastq files consist of 4 lines for each sequence. The lines are:

1. A unique identifier for that sequence
2. The sequence itself
3. Often a repeat of the unique identifier, but it can be anything
4. String of phred quality scores

What are phred quality scores?: They give the probability that the base call is correct. This convention was originally developed to aid interpretation of Sanger sequencing traces.

Sanger Q-scores from traces

Figure 2: Sanger Q-scores from traces

The phred quality score (Q) is related to the log of the error probability (E):

Q=-10* log10(E)

This table shows the probabilities associated with different phred scores.

Phred interpretation

Figure 3: Phred interpretation

Illumina phred (Q) scores:

1. In current HTS data, the max quality score is typically 40.
2. These phred scores are represented by ASCII characters that put the scores into a single character per byte (or base).
3. The scores are typcially ASCII +33, in order to avoid odd characters and functions of the first 32 ASCII characters.
4. Illumina originally made their q-scores ASCII +64 so they would correspond to letters, however they switched due to pressure from the rest of the community (that already used ASCII +33).

You can check out the ASCII table here: http://www.asciitable.com/.

ASCII-Phred conversion

Figure 4: ASCII-Phred conversion

9 QC and preprocessing

Two useful Quality control and read processing programs to use before mapping to genomes:
FastQC.
and
Fastx toolkit

Both programs are on the Xanadu server and can be loaded with module.

10 FastQC: quality control for .fastq files

Performs moderately thorough QC and is extremely easy to use.

Concerns for QC:
1. Quality scores (Phred) for each base and average for reads
2. Adapter contamination (no DNA or RNA insert)
3. Duplicate sequences (library complexity)
4. Sequence lengths

Type of common preprocessing manipulations:

1. Remove adapter sequences from reads (fastx clipper)  

2. Trim reads to a defined size (fastx trimmer)
3. View duplicate sequences (fastq collapser)
4. Filter out low quality reads (fastq quality filter)
5. Filter out low quality portions of reads (fastq quality trimmer)

USAGE: For FastQC, it is simple:

#To Look at the usage instructions:
module load fastqc

fastqc -h

#To run an analysis
fastqc  <data.fastq>

Helpful tip: Opening html files from command line.

FastQC creates individual files of the evaluation of each criteria. All of the results are viewable within a browser with the supplied HTML file. You can use the following command format to open fastQCreports.

open -a "/Applications/Google Chrome.app" <FILENAME.html>

However, I typically navigate to the directory and open the Finder.

cd </path/containing/the/html_file> # I am usually already there
open .

# OR

open /path/containing/the/html_file

I often have HTML files on a cluster and I do not view them remotely, so I use sftp to transfer them to my machine and open them locally. This is clunky, but it works.

10.1 In class exercise 1: FastQC

Remember to use an interactive session when running these commands!

fastQC exercise:

1. Use the sratoolkit to download SRR1552484 to your data folder.
2. Write only the first 1000000 lines to a new file: call it test2.fastq
3. Go to the FastQC website. View reports from some of types of data (Good, Bad, Contaminated) under Example reports.
4. Run fastQC on the test2.fastq data as well as the test1.fastq data (from SRR039637 data we already downloaded).
5. View the contents of the resulting .zip file(s). All of these files can be viewed in the included .html file.
6. Open a new terminal window on your computer and use sftp or scp to copy the html link to your computer and view the report on a local browser.

Note: There is also a multiQC program on the server that you can use to compile all fastqc results from a single experiment together. This makes for easy comparisions between samples and/or replicates

See https://multiqc.info/.

To use it:

module load multiqc  # load multiqc software

multiqc --help  # view usage and option

11 Answers to in class exercises

11.1 In class exercise 1 (answers): FastQC


1) Use the sratoolkit to download SRR1552484 to your data folder.

cd data

module load sratoolkit

fastq-dump SRR1552484 -F --gzip
  1. Just as in class, move only the first 1000000 lines to a new file: call it test2.fastq
zcat SRR1552484.fastq.gz | head -1000000 > test2.fastq

3-4) Run fastQC on the test2.fastq data as well as the test1.fastq data (from SRR039637 data we already downloaded).

module load fastqc 

fastqc test1.fastq

fastqc test2.fastq
  1. View the contents of the resulting .zip file(s). All of these files can be viewed in the included .html file.

  2. Open a new terminal window on your computer and use sftp (or scp if you prefer) to copy the html link to your computer and view the report on the internet.

From your machine:

sftp meds5420usrX@transfer.cam.uchc.edu

get <path/to/html/file.html>